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Abstract 

In this paper we consider the wavelet synopsis construction problem without the restriction that we 
only choose a subset of coefficients of the original data. We provide the first near optimal algorithm. 

We arrive at the above algorithm by considering space efficient algorithms for the restricted version 
of the problem. In this context we improve previous algorithms by almost a linear factor and reduce 
the required space to almost linear. Our techniques also extend to histogram construction, and improve 
the space-running time tradeoffs for V-Opt and range query histograms. We believe the idea applies to 
a broad range of dynamic programs and demonstrate it by showing improvements in a knapsack-like 
setting seen in construction of Extended Wavelets. 

1 Introduction 

Wavelet synopsis techniques have become extremely popular in query optimization, approximate query 
answering and a large number decision support systems. Wavelets, specially Haar wavelets, are one-one 
mappings and admit a natural multi-resolution interpretation, as well as fast algorithms for the forward and 
inverse transforms. 

Given a set of n numbers X = xi,...,x n the wavelet synopsis construction problem seeks to choose 
a synopsis vector Z with at most B non-zero entries, such that the inverse wavelet transform of Z (denoted 
by W~ 1 (Z)) gives a good estimate of the data. The typical objective measures are (suitably weighted*) £k 
norm of X - W _1 (Z). In an early paper [15], demonstrated a number of different applications for wavelet 
synopsis and proposed greedy algorithms. However for objective measures other than the £2 measure, the 
greedy algorithm does not necessarily provide the optimum solution. The problem is quite non-trivial, 
primarily due the fact that the Wavelet basis vectors overlap and cancellations (subtractions) occur. This 
means that we can have two coefficients that cancel out each other leaving a significantly (exponentially) 
smaller contribution, which needs to be accounted for. The precision of the coefficients in the optimum 
solution can be much larger than the precision of the data. In fact there are no known bounds or promising 
techniques for quantifying the precision - this is the biggest stumbling block in the synopsis construction. 

Most of the literature focuses on the Restricted case where the non-zero entries of Z are equal to the 
corresponding entries in the transform of the original data, W(X). A natural question remains: why should 
we be optimizing under the restriction of retaining the coefficients of the data — with no guarantees that 
such a restriction does not compromise the quality of the final synopsis? This is clearly suboptimal - a 
comparable example would be to optimize the synopsis for point queries, and use it for range queries. 

A simple example renders the discussion concrete; X = {1, 2, 3, 7} and B = 1 illustrates that choosing 
any single coefficient of W(X) = {3.25, —1.75, —0.5, —2} (non-normalized) does not give the optimum 
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answer for t\ or norm. Normalization does not help. The normalized transform is {4.55, —2.45, —0.5, —2} 
- but choosing the first coefficient as 4.55 in the normalized setting implies assigning 4.55/^2 = 3.25 ev- 
erywhere. Thus dynamic program approaches that seek to see the effect of the coefficient on the data come 
to the same conclusion in both settings. The optimum choices of Z are {z, 0, 0, 0} for any 2 < z < 3 and 
{4, 0, 0, 0} for l\ and respectively. The same example applies to weighted £2, e-g-, if vr = {^, |, §, §} 
then the best error achieved by retaining any single entry of W(X) is 5.78 whereas Z = {4.65, 0, 0, 0} 
gives an error of 4.87. The example can be extended to any B (by repetition and scaling). The restriction of 
only retaining the coefficients of the data is significantly self defeating. 

However the restriction does ease the search for a solution, and as this paper shows, is an important 
stepping stone towards the final result. For the restricted case, [ 5 1 gave a probabilistic scheme (the space 
constraint is preserved in expectation only, along with the error) and very recently j4| gave an optimal 
solution. This has been extended and improved in lUTI . However, the solution to the unrestricted case has 
remained elusive and we provide the first near optimal solutions. In the process, we also improve upon 
previous algorithms for the restricted case as well. However our algorithm is best explained by taking a 
different path, which brings us to the major theme of the paper. 

Synopsis construction is perhaps most relevant in context of massive data sets. In some scenarios we can 
justify that the synopsis is created using a "scratch" space larger than the synopsis and stored. However 
a quadratic or extremely superlinear space complexity is near infeasible for large n. The dependence on 
synopsis size B is also important in this context - the smaller the dependence is, the larger is the synopsis 
that can be computed in the environment of a particular system. Further, space is typically a more inflexible 
resource, and not just a matter of wait. However a natural conceptual question arises: We are only given n 
numbers, — do we really need to save so much information to compute the optimum answer ? 

All previous algorithms (for the restricted case) are expensive in space (see table below). This (super- 
linearity in n, B) is also seen in context of histogram construction (we provide a detailed table in Sec- 
tion I4.lt . To avoid this expensive space complexity, several researchers have introduced the notion of 
working space, which is the amount of space required to compute the error - the rest of the space is used 
to construct the answer (coefficients, representatives, etc.). In case of wavelets the working space used by 
previous algorithms is 0(nB). In case of histograms, known algorithms reconstruct the answer only using 
the 0(n) working space, but with a penalty of an extra factor of B in the running time. In this paper, we 
reduce the space for wavelets and eliminate the penalty for histograms, in fact our results show that the 
working space notion is not needed for a wide range of problems. To summarize Our contributions: 

• We provide the first near optimum algorithm for the wavelet synopsis construction problem. The 
algorithm naturally extends to multiple dimensions. 

• For the restricted case [5| provided approximation algorithms, however the space constraints were 
obeyed in expectation. The results for (optimum) algorithms with strict space bounds are <: 
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[4 1 also provided approximation algorithms for multiple dimensions and our techniques extend to this 
context as well, and improves the running time and space by almost a factor B. 



^In [ 17 1 the space bounds are not explicitly provided, but the total space appears to be 0(n 2 B/ log B) as well. The authors of 
1141 consider the same problem for a non-Haar basis, and is excluded from the discussion here 
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• We improve several histogram construction algorithms, e.g., V-Opt histograms, range query his- 
tograms, by simultaneously achieving the best known running time and space bounds. The results 
and a table comparing the results are presented in Section |4~T1 Due to lack of space omit the improve- 
ments for the range query histograms, which are similar. 

• We believe the space efficient paradigm is applicable to other dynamic programs as well, and we 
demonstrate the improvements in case of Extended Wavelets in Section l4~2l 



2 The Restricted (Haar) Wavelet Synopsis construction Problem 

We will work with non-normalized wavelet transforms where the inverse computation is simply adding the 
coefficients that affect a coordinate^ . The wavelet basis vectors are defined as (assume n is a power of 2): 

V (j)= 1 for all j 

w = { -i Z | -% + r<r< I d<*<*.i<.< >og«) 

The above definitions ensure W _1 (Z) = J2iZiVi. To compute W(X), the algorithms computes the av- 
erage a 2»+i+ a 2»+2 anc j ^ e diff erence X2l+1 ~ X2 '+ 2 for each pair of consecutive elements as i ranges over 
0, 2, 4, 6, . . . The difference coefficients form the last n/2 entries of W(X). The process is repeated on the 
n/2 average coefficients - their difference coefficients yield the ra/4 + 1, . . . ,n/2'th coefficients ofW(Z). 
The process stops when we compute the overall average, which is the first element of W(Z). The wavelet 
basis functions naturally form a complete binary tree since their support sets are nested and are of size 
powers of 2 (with one additional node as a parent of the tree, see Figure 0- The Xj correspond to the 
leaves, denoted by boxes, and the coefficients correspond to the non-leaf nodes of the tree. This tree of 
coefficients is termed as the error tree (following [4]). Likewise assigning a value Cj to the coefficient cor- 
responds to assigning +a to all leaves j that are left descendants (descendants of the left child) and — Cj 
to all right descendants. The leaves that are descendants of a coefficient are termed as the support of the 
coefficient. Recall that the Restricted (Haar) Wavelet construction problem is that given a set of n numbers 
X = xi, . . . ,x n the problem seeks to choose at most B terms from the wavelet representation W(X) of X, 
say denoted by Zr, such that a (weighted) £k norm of X — W~ 1 (Zr) is minimized. 



2.1 Reviewing Previous Algorithm(s) 

It is immediate that the value of W~ 1 (Zn)j is fixed by the choices of all coefficients i such that j belongs 
to the support of i. Suppose S is a subset of the ancestors of a coefficient i. Thus a natural dynamic program 
emerges where we define E[i, b, S] to be the minimum contribution to the error from all j in the support of 
i, such that exactly b coefficients that are descendants of i are chosen along with the coefficients of S. The 
algorithm is given in Figure Eb). Clearly the number of entries in the array E\\ is Bn times 2 r where r is 
the maximum number of ancestors of any node. It is easy to see that r = log n + 1 and thus the number of 
entries is n 2 B. For l\ measure we need to spend 0(B) time in the minimization giving a running time of 
0(n 2 B 2 ). For l^, we may perform binary search and only need log B time (see [4]). 



2.2 A Simple Improvement 

Observation 1 A node i al level ti can have at most 2 li — 1 descendants. Thus E[i,b, S] is meaningful only 
for 2 li values ofb {including b = 0). Further, the number of nodes at level ti is [^-] and the number of 
possible subsets of ancestors of a node is 2 lo s n+1 ~* i . 

*For normalized wavelets the normalization constant appears both in forward and inverse transform, all the results in the paper 
will carry over in that setting as welfwith the introduction of the normalization constants at several places 
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At each internal node i to compute E[i,b,S]: 

• We determine if we are choosing the coefficient i. 

• Assuming we are, we decide how the remaining b — 1 coefficients 
are to allocated between the two subtrees. If the children are %l 
and ir, we are interested in 

min E[i L , b' , S U {i}} + E[i R , b - 1 - b' , S U {i}} 

b' 

• Assuming that we do not choose i we are interested in a similar 
expression giving the overall minimization to be 

mm b , E[i L , b', S U {i}] + E[i R , b-l-b',SU {i}} 
mm b , E[i L ,b',S}+E[i R ,b-b',S} 



(a) 



(b) 



Figure 1 : The Error Tree and the previous algorithm. 



Thus the number of E\\ entries to fill corresponding to i is 2 logn+1 t% min{i?, 2* 1 }. The time takes is 
2iogn+i-ti m in{.B 2 , 2 2ti }. Thus one way of computing the total time taken is 

log n log B log n 

J]_2^gn+i-u mm {^ 2 2 *i } + b 2 = ^ — 2 logn+1 ~' l 2 2 ' 1 + ^2 — 2 logn+1 ~ u B 2 + B 2 

U=l 2 1 U=l 2 1 <i=logB+l 2 1 

logB logn— log_B log n— log B 

= I> 2 + E ^^-^B 2 + B 2 =2n 2 lo g B + 2n 2 ^ ± + B 2 

ti = l u=l u=l 

which is 0(n 2 log B). In case of l^, the expression J2u=i §-2 logn+1 - u m\n{B log B, U2 U } + B 2 can be 
shown to be 0(n 2 ) using the same scheme and change of variables as above. 



2.3 The Intuition and the new algorithm 

The properties that stands out from the above dynamic program are 

• There is no connection between E[i,b, S] and E[i,b' , S'] as long as S ^ S'. 

• We do not need E[i, b, S] while computing E[i' , 6', S'] unless i is a child of i' and either S = S' or 
S = S'U{i'}. 

• And finally, there is no need to allocate space for E[i,b,S] while computing E[i',b',S'] if i is an 
ancestor (not a descendant) of i 1 . 

The simplest view of the new algorithm that computes the same table (but it is not stored in entirety at 
any time) is a parallel algorithm, where there is a processor at each node of the error tree. The algorithm at 
a node i with children il^r can be described as follows: 

1 . The node i receives S from its parent and seeks to return an array of size B (or less) corresponding to 
E[i, b, S] for < b < B. It actually receives 

v{i,S)= ^2 c v ~ E Ci ' 

i'eS,i left descendant of i' i><=s,i right descendant of v 

2. To evaluate min fe / E[i L , b', S U {i}] + E[i R , b - 1 - b', S U {i}] the node i passes S U {i} to both of 
its children, i.e., v(zl, S U {i}) and v(ir, S U {i}). The children return the two arrays of size B (or 
less), and the min^/ is performed for each b. Note that the right child can reuse the same space needed 
by the left child. 
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3. Now i passes S to the children and asks for E[iL,b,S] for all b and likewise for ir. 

4. The node i can now compute all E[i, b, S]. The entire time spent at this node is min{2 2ti , B 2 }. 

5. If i is the overall root, then i also performs a minimization over all b to find the solution with at most 
B coefficients. 

Lemma 1.1 No node receives the value v(i, S) twice for the same set S. 

The above shows that the algorithm is correct and runs in time 0(n 2 logn) (and 0(n 2 ) for i^). The next 
lemma is also immediate from the description of the algorithm: 

Lemma 1.2 The space required at node i is mm{B, 2**}, since this space is used for all S. 

Thus the total space required is 0(B\og(n/B)) (the last log-B levels use geometrically decreasing space 
which sums to 0(B) and log n — log B = \og(n/B)). Therefore if we consider the algorithm that simulates 
the parallel algorithm, we can conclude with 

Theorem 2 We can compute the error of optimum B term wavelet synopsis in time 0(n 2 log B) (andO(n 2 ) 
for loo) using overall space 0(n + B login /E)) = 0{n). 

Observe that we can only compute the error, and we do not know which coefficients are in the synopsis. 
2.4 How do we find the coefficients? 

We now show how to retrieve the coefficients after finding the total error. When we find the optimum error, 
we also resolve (i) if the topmost coefficient is present or not and (ii) what is the allocation of the coefficients 
to the left and right children. Armed with these two pieces of information, we simply recur se/recompute, 
i.e., we pass the appropriate set (or v(i, S) values) to the two children and their respective allocations. Each 
child now finds the total error restricted to its subtree and each decides on the two pieces of information to 
set up the recursive game. 

Analysis: Let the running time of the recompute strategy be f(n). To find the optimum error, we spend 
cn 2 log B time and therefore we have the recursion: 

fin) = cn 2 log B + 2f (n/2) 

If we unroll the recursion one step, we see that f(n) = cn 2 log B + 2c(n/2) 2 log -B + 4/(n/4). We 
can immediately observe that we are setting up a geometric sum and we can bound f(n) by 2cn 2 log-B. 
Therefore we conclude: 

Theorem 3 We can compute the complete solution, i.e., total error and the stored coefficients of the optimum 
B term wavelet synopsis in 0(n 2 logB) time (0(n 2 )for 4J using overall space 0{n + B log(n/B)). 

Caveat: We have to be careful and ensure that when we output the coefficients recursively, we output all 
the coefficients of the first half before outputting all the coefficients of the next half. In the process, we need 
to remember the partition of the buckets, the parameter b', for logn levels. But since we have to remember 
only 1 number, the total space is 0(n + Blog(n/B) + logn) = 0{n + Blog(n/B)). 
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3 Unrestricted Wavelet Synopsis construction Algorithms 



We now show how to obtain an approximation algorithm for the general/unrestricted wavelet synopsis con- 
struction problem. We focus our attention on 1^ error, we indicate the changes necessary for the weighted 
case appropriately. Recall that the Wavelet synopsis problem is: Given a set of n numbers X = x±, . . . , x n , 
find a Z G TV 1 with at most B non-zero entries such that \\X — W~ l {Z)\\k is minimized. 

The following will be an important observation leading towards a suitable algorithm: If we observe the 
previous algorithm based on assigning a processor to each coefficient in the error tree, we immediately 
observe that if for different subsets of ancestors, we receive the same value, i.e., v(i,S) = v(i, S') for 
S' / S, we need not redo the computation. Note: that the savings cannot be guaranteed and in order to 
achieve the savings we have to increase the space bound. 



Overview: The above will form a kernel of our algorithm for the (unrestricted) wavelet synopsis construc- 
tion problem. We would actually perform the computation for all possible, anticipated values of v(i, S). 
However, non-zero elements of Z can have any real value and it is not clear how to restrict the set of values. 

In what follows, we first describe the algorithm assuming that the wavelet coefficients belong to a set of 
anticipated values R. Subsequently we describe how to determine R and more importantly, bound \R\. 



3.1 The Algorithm 

Definition 3.1 Let E[i, v, b] be the minimum possible contribution to the overall error from all descendants 
of i using exactly b coefficients, under the assumption that the combined value of all ancestors chosen is v. 

The overall answer is clearly min?, E[root, 0,b]. A natural dynamic program is immediate, to compute 
E[i, v, b] if we decide the best choice is to allocate b' coefficients to the left and let the i th coefficient be r, 
then we need to add E[il, v + r, b'] and E[ir, b — b' — 1, v — r\. The overall algorithm is: 

1. The number of b that are relevant to i is min{f?, 2 Ti }. The node receives the E[il,v', b'], E[ir, v", b"] 
from its children. 

2. A non-root node computes E[i, v, b] as follows: 

r. , , _ f mm r ^/ E[%l, v + r, b'] + E[ir, v — r, b — b' — 1] i th coefficient is r 

L ' ' \ min&/ E[il, v, b'] + E[ir, v, b - b'] i th coefficient not chosen 

3. If i is the root, then i computes 



mm 



min r) ft/ E[iL,r, b'] + E[iR, r, b — b' — 1] root coefficient is r 



b \ mm;/ E[il, 0, b'] + E[ir, 0, b — b'] root coefficient not chosen 

Note that the root can figure out (i) the optimum error (ii) if any coefficient corresponding to it is chosen 
and (iii) the value r of the coefficient. After the final solution is computed, we apply the recompute strategy, 
and each node in the tree finds out if it has a coefficient in the answer and its value. The running time is 

^2\R\mm{2 r \B} ■ \R\ min{2 r %5} = ^ |i?| 2 ^ min{2 2 *, B 2 } = \R\ 2 nB 

i t 

For loo the bound is Ylt l-^l W m ™ 

•^2', 5 log 5} = 0{n\R\ 2 log 2 B). The required space can be 
shown to be 0(RB log (n/B)) ensuring that the computation resembles a post-order traversal of the tree 
and we do not the tables of the children nodes once we are done. Thus for each level we may need at most 
2 tables of size Rmvn{B, 2 e }, which sums to the above.. 
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3.2 Computing R 

Lemma 3.1 If the M then maxj |W(JT)j| < M. 

Proof: The I s * coefficient is the average of all values and therefore cannot exceed M. Every other coefficient 
is half the average value of left half (of the support) minus half the average value of right half. Each cannot 
be more than M in absolute value. I 

Lemma 3.2 If the optimum solution is Z* then maxj \Z*\ < 2n~kM. 

Proof: If maxi > 2niMthen \\X - W -:L (Z*)|| fc > - ||X|| fe and 

||W -1 (Z*)||fc- ||X|| fc > ||W _1 (Z*)|| fe -Mn* > max|W _1 (^*)il - Mn* > Mni > \\X\\ k 

i 

The all zero solution is a better solution, which is a contradiction. Now we apply Lemma 13.11 and get 
maxj \W(W~ l (Z*))i\ = max, \Z*\ < 2n~kM, which proves the lemma. I 
In case of weighted l k the above is modified to maxj \Z*\ < 2nfc M mii ^ — . The next lemma follows from 
triangle inequality. 

Lemma 3.3 If we round each non-zero value of the optimum Z* to the nearest multiple of 5 thereby obtain- 
ing Z, then \\X - W- x (Z)\\ k < \\X - W~ 1 {Z*)\\ k + 5n^ and \R\ < 2a|M. 

Therefore if we set 6 = eM/n^ we can say that we have an additive approximation of eM as well as 

2 

\R\ = 0{enk). Therefore we conclude the following: 

Theorem 4 We can solve the Wavelet Synopsis Construction problem with t k error with an additive approx- 
imation of eM where M = maxj \x%\ in time 0(n 1+ fc Be~ 2 ) and space 0(n + n^e^B \og(n/B)). For 
the running time is 0{ne~ 2 log 2 B). 



4 The theme of space efficiency and applications 

A natural paradigm emerges from inspecting the above: If we can compute the total error and the best 
way to partition the problem into two halves of ^ elements, we do not need to store the entire dynamic 
programming table - and thereby save space. If we can compute the overall error in time f(n) = An a 
where A is independent of n, then the time taken by the Recompute strategy is g(n) = f(n) + 2g(n/2). The 
solution to the recurrence is 0(An a ) if a > 1 and 0(An log n) if a = 1. 

We demonstrate the above idea in two examples. First, we show its impact in space efficient V-Opt 
histogram construction. Second, we show the applicability in a new synopsis technique, Extended Wavelets. 

The idea also improves several results on range query histograms - however those algorithms are quite 
similar in spirit to the V-Opt histogram construction and we relegate the discussion to a fuller version of the 
paper. However the idea does help in reducing the space bound across the board - in fact for a large variety 
of problems it is immediate that the notion of working space, the space necessary to compute the value of the 
final answer, is not required any more. We can compute the entire answer, in the aforementioned working 
space. 
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4.1 V-Opt Histograms 



The V-Opt histogram is a classic problem in synopsis construction. Given a set of n numbers X = 
x\,...,x n the problem seeks to construct a B piecewise constant representation H such that || X — H\\2 
(or its square) is minimized. Since their introduction in query optimization in fill , and subsequently in 
approximate query answering ([ 1 ], among others), histograms have accumulated a rich history fT2l . Several 
different optimization criteria have been proposed for histogram construction, e.g., t\, relative error, 1^, to 
name a few. However most of them are based on a dynamic program similar to the V-Opt case. Thus the 
V-Opt histograms provide an excellent foil to discuss all of the measures at the same time. As mentioned 
in the introduction, [ 13 1 gave a 0(n 2 B) time algorithm to find the optimum histogram using 0(nB) space. 
They observed that the space could be reduced to 0(n) at the expense of increasing the running time to 
0(n 2 B 2 ). The data stream algorithms^ of (extended in [8]) represent sparse dynamic tables - but the 
space is still 0(B 2 ), a quadratic in B. In a those algorithms the 0(B 2 ) space performs a double role of 
storing the coefficients as well as maintaining a frontier. 

This is somewhat remedied in where a robust wavelet representation of 0(B) coefficients is 

constructed and then a dynamic program in the fashion of fT3l or (9j restricted to the endpoints of the 
support regions is used. The dynamic program of FP3ll can be used to compute the answer in 0(B) space, 
but with an extra factor of B in running time. Therefore, irrespective of offline or streaming computation 
there was a tradeoff between large space and an increased running time - this is the penalty referred to in the 
introduction. This is the first paper which removes that penalty and gives an algorithm that simultaneously 
achieves the best known space and time bounds. 
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Opt 


0(n 2 B) 
0(n 2 B 2 ) 


0(nB) 
0(n) 


0(n) 
0(n) 
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0(n + (n/M)B 3 e~ 2 log 3 n) 


0{n + B 2 e- 1 ) 
0(M + B 2 e~ 1 logn) 


0(n + Be- L ) 


GH 


Yes 
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0(n + B A e-' A (log 1 /e) log n) 
0(n + B 4 e- 3 (log 1 /e) log n) 


0{Be-' 2 (log 1/e) log n + B 2 /e) 
0(Be~ 2 (log 1/e) logn + B/e) 




This Paper 


No 
No 
Yes 


Opt 

(i + 
(i + 


0(n 2 B) 
0(n + B 3 (e~ 2 + logn) logn) 
0(n + B 3 e- 3 (logl/e) logn) 


0(n) 
O^ + Be- 1 ) 
O (Be~ 2 (log 1/e) logn + B/e) 


0(n) 
0(n + Be- 1 ) 



Algorithm idea: Due to lack of space, we indicate the modification to the optimum algorithm. The modi- 
fications to the approximation and streaming algorithms are similar. The optimal algorithm maintains E[i,b] 
which is the minimum error of expressing the interval [1, i] by at most b buckets (intervals where the repre- 
sentation is constant). A natural dynamic programming arises: E[i, b] = min^ E[j, b — 1] + e(j + 1, i) 
where e(j, i) is the minimum error of a single bucket^. The running time is 0(n 2 B). If we are interested 
in computing only the final answer, there is an 0(n) space algorithm which computes E[i, 1] for all i, and 
then extends that to b = 2, 3, etc. 

If i > we maintain A[i] to be the starting point of the bucket that contains the i» for the best 
representation of [l,i] by b buckets, and B[i] to be the ending point of that interval, and C[i) to be the 

§ Note that by the streaming model we refer to the "sorted" or "aggregate" model, most useful in time series data, where the 
input is Xi in increasing order of i. Only 1 6| applies to the general "turnstile" or "update" model, but seems to have high polynomial 
dependence on BeT 1 log n. See 1 16 2| for more details on data stream models. 

"it is straightforward to show that the minimum error is achieved by the mean of ay+i, . . . , x%. 
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number of buckets used before A[i]. This requires 0(n) space, and is updated as shown below. Now, after 
we compute E[n, B] we can divide the problem into two parts, representing [1, A[i]] using C[i] buckets and 
[B[i] + 1, n] by B — C[i] — 1 buckets. Note that each subproblem is defined on ^ or less elements. Therefore 
the Recompute strategy will run in time 0(n 2 B) as well and compute all the coefficients. 



1. A[i] = if i < t£ and 1 otherwise. B[i] = Oif i < § and i otherwise. c[i] = for all i. 

2. For b = 2 to B do 

3. For i = 2 to n/2 do 

4. £?[t, 6] = min,^ E[j, b - 1] + e(j + 1, i) 

5. For i = n/2 to n do 

6. J5[«, b] = min i<l 6 - 1] + e(j + 1, i) 

7. If j (which achieved the minimum) < ^ then newA[i] = j + 1, newC[i] = 6, neu>.B[i] = i. 

8. else newA[i] — A[j], newB[i] = B[j], newC[i] = C\j\\ 

9. A «— new A, B <~ newB, C <— newC. 

10. Recurse using A[n] , B[n] , C[n] to compute the coefficients. 



Figure 2: The 0(n) space optimum algorithm 

Observe that we wave kept the E[j, b — I], E[i,b] notation, but we can reuse two arrays of size n for this 
purpose (and keep switching them as newE, E etc.) - the overall space required is 0(n). We now know 
the final solution E[n,B] and how to partition the problem. For offline approximation algorithm, when 
we recurse, we have to add the approximate error + 1, C[i] + 1] to all the elements on the right 

subproblem (since we build histograms with error increasing by 1 + e factor, this "shift" is needed). Due to 
lack of space, the details are relegated to the full version. 

4.2 Extended Wavelets 

Extended wavelets were introduced in 0. The central idea is that in case of multi-dimensional data, there 
can be significant saving of space if we use a non-standard way of storing the information. There are 
several standard ways of extending 1 -dimensional (Haar) wavelets to multiple dimensions. The wavelet 
basis corresponds to high-dimensional squares. But irrespective of the number of dimensions, the format 
of the synopsis is a pair of numbers (coefficient index,value). In Extended Wavelets we perform wavelet 
decomposition independently in each dimension but then we store tuples consisting of the coefficient index, 
a bitmap indicating the dimensions for which the coefficient in that dimension is chosen,and a list of values. 
Since the coefficient number and the bitmap is shared across the coefficients, we can store more coefficients 
than a simple union of unidimensional transforms. 

Notice that there is no interaction between the benefits of storing coefficient i and i'. The problem 
reduces naturally to a Knapsack problem with a twist that each item (coefficient i) can be present in varying 
sizes (how many values corresponding to different dimensions are stored). However the variant also has 
a simplifying feature that the space bound is polynomially bounded, therefore allowing a simple dynamic 
program. The program estimates E[i,b] which indicates the minimum error on using at most b space and 
storing only a subset of the first i coefficients. 

The idea is relatively new, and it remains to be seen if Extended wavelets are applied widely. But it is 
an intriguing and novel idea in synopsis construction and serve as an example of the broad applicability of 
the ideas in this paper. This paper is also the first (almost) linear (0(B), ignoring M) space algorithm in the 
streaming (as well as offline) model. We present the results on the optimum algorithms belowll. 

'The input is n tuples in M dimensions and the total synopsis size is B. The papers |5][8) contain other approximation 
algorithms that are not relevant to our context. The extended version of 1 8 1 reduces Extended Wavelets to a problem similar to 
V-Opt histogram construction and gives a O(NM) time algorithm using dynamic programming. The ideas of this paper naturally 
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Algorithm Idea: We follow the previous algorithms and introduce a few small changes and a more care- 
ful analysis. For each item i we compute the best profit if i is allocated size j. This is done in time 
0(nM log M) as in [8 1. For each 1 < j < M we maintain the top B/j items corresponding to size j. For 
each j we can achieve this in O(Bfj) space and 0(n) running time (using details from |7 |), using overall 
0(nM) time and = 0{BM) space. The optimum answer uses items and sizes from this list 

only. The total number of item-size pairs are J2j(B/j) = 0(B log M). 

We can sort this list in lexicographic order. Suppose item i has x% > 1 occurrences (thus ^ Xi = 
0(B log M)). The dynamic program to extend the answer to i (from the item before i) first needs to 
guess/choose which of the xi occurrences are used (or none) and compute the best solution for each B. The 
time taken is c(xi + 1)B at i, which totals to at most 2cB 2 log M. 

We maintain a O(B) array where P[z] corresponds to the best profit for space z up to the current i. For 
space efficiency, for z > B/2 we keep track of Q[z] which contains the pair (,i',r,b') s.t. the optimum 
solution for space z for current i uses space b' < B/2 Upton i' and a size r copy of i' with b' + r > B/2. 
In other words, the crossing point where we crossed B/2 space for that solution (which remains same even 
if we extend it later). 

We now recurse with b, b' < B/2 on the two parts. Now each item contributes c(xi + l)B/2 adding up 
to less than cB 2 log M. Once again we have a geometric sum which sums up to 0(B 2 log M) for the entire 
recursion. 

Acknowledgments: We would like to thank Hyoungmin Park and Kyuseok Shim for many interesting 
discussions. 
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